[Day 21] Regularization 正規化 - part II - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 11 屆 iThome 鐵人賽

DAY 21

AI & Data

跟top kaggler學習如何贏得資料分析競賽系列第 21 篇

[Day 21] Regularization 正規化 - part II

11th鐵人賽

madeleine

2019-09-22 21:40:22

1358 瀏覽

分享至

Extensions and generalization ：4 種資料狀態的處理

回歸與多類別中進行 Mean encoding
多對多關係的運用
時間序列
特徵互動 encoding 和數值類特徵

Regression and multiclass / 回歸與多類別

回歸的做法比較簡單, 可以直接用統計的中位數, 分位, 標準差, 分布分類, 直方圖等分析手法.
多類別的 N 代表類別數量, 我們就對 N 進行 encoding, 每種類別, 舊的做法是餵進每種類別不同的 model, 但我們無法得知整個資料的架構, 因為會所有類別會整個併到該資料的一個大主題下.
所以針對不同類別的 mean encoding 可以看下一個是談到多對多的方法.

Many-to-many relations / 多對多關係的運用

面對用戶名下多產品 (例如一支 smartphone 會下載很多個 apps), 但一個產品對應的又會是多用戶的多對多關係.
將類別向量化就可以開始用統計的最大最小值, 平均數, 中位數, 分位, 標準差, 分布, 直方圖等分析手法了.
下面例子是要了解客戶下載 APP 的情況, 左邊可以看到一個 user 會下載多個 APP, 右邊則是變成長式子表示, 所以 APP 會變成向量表示(0.1, 0.2, 0.1)後就可以做 mean encoding 平均數編碼跟統計分析了

截圖自 coursera

Time series / 時間序列

時間型資料架構可以處理很多複雜特徵
針對目標變項的滾動統計可以跑出新特徵
下面例子是要知道客戶會花費在哪個東西上, 總共兩天, 兩個客戶三個東西類別的資料, 第1天用户101花费6美元，用户102花费3美元, 所以以平均數當做是第2天的未来值。也可以按照類別計算平均金额來新增新特徵.

截圖自 coursera

Interactions and numerical features / 特徵互動 encoding 和數值類特徵

數值型作法是直接分類
分類跟選擇特徵間的互動
問題是如何分類跟找到好的特徵組合來分析, 下列以 decisino tree / 決策樹來說明, 若兩特徵節點相鄰且有互動, 可算出互動頻率,互動高頻的特徵配對是可以拿來考慮做 mean encoding, 舉例將互動頻率最高的串聯(concatenate)後 mean encoding

截圖自 coursera

[extra] 老師在這邊說到參賽 Amazon - employee access chanllenge competition 時用 mean encodng 也只有 AUC o.87, 大概是 700 名, 後來用 cat boost model 達到 0.91, 然後就贏了. 但老師提醒 cat boost model 也不是萬能的, 仍須人工處理特徵間互動.

「老師的提醒」正確的 validation 步驟

Local experiments
- Estimate encodings on X_tr
- Map them to X_tr and X_val
- Regularize on X_tr
- Validate model on X_tr/ X_val split
Submission:
- Estimate encodings on whole Train data
- Map them to Train and Test
- Regularize on Train
- Fit on Train
  
  截圖自 coursera

手把手 Mean encodings

老師手把手 Mean encodings 的筆記本連結在此：
https://hub.coursera-apps.org/connect/eyusuwbavdctmvzkdnmwro?forceRefresh=false&token=s1i9boL9de22SaUVCfpJ&path=%2Fnotebooks%2FProgramming%2520assignment%252C%2520week%25203%253A%2520Mean%2520encodings%2FProgramming_assignment_week_3.ipynb

若不想去點連結, 下面我也整本 copy 來了

In this programming assignment you will be working with 1C dataset from the final competition. You are asked to encode item_id in 4 different ways:

Via KFold scheme;
Via Leave-one-out scheme;
Via smoothing scheme;
Via expanding mean scheme.
You will need to submit the correlation coefficient between resulting encoding and target variable up to 4 decimal places.
General tips
Fill NANs in the encoding with 0.3343.
Some encoding schemes depend on sorting order, so in order to avoid confusion, please use the following code snippet to construct the data frame. This snippet also implements mean encoding without regularization.

import pandas as pd
import numpy as np
from itertools import product
from grader import Grader

Read data

sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')

Aggregate data

Since the competition task is to make a monthly prediction, we need to aggregate the data to montly level before doing any encodings. The following code-cell serves just that purpose.

index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})

#fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

Mean encodings without regularization

After we did the techinical work, we are ready to actually mean encode the desired item_id variable.
Here are two ways to implement mean encoding features without any regularization. You can use this code as a starting point to implement regularized techniques.

Method 1

# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

Method 2

'''
     Differently to `.target.mean()` function `transform` 
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True) 

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

See the printed value? It is the correlation coefficient between the target variable and your new encoded feature. You need to compute correlation coefficient between the encodings, that you will implement and submit those to coursera.

grader = Grader()

1. KFold scheme

Explained starting at 41 sec of Regularization video.
Now it's your turn to write the code!
You may use 'Regularization' video as a reference for all further tasks.
First, implement KFold scheme with five folds. Use KFold(5) from sklearn.model_selection.
Split your data in 5 folds with sklearn.model_selection.KFold with shuffle=False argument.
Iterate through folds: use all but the current fold to calculate mean target for each level item_id, and fill the current fold.
See the Method 1 from the example implementation. In particular learn what map and pd.Series.map functions do. They are pretty handy in many situations.

# YOUR CODE GOES HERE

# You will need to compute correlation like that
corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('KFold_scheme', corr)

2. Leave-one-out scheme

Now, implement leave-one-out scheme. Note that if you just simply set the number of folds to the number of samples and run the code from the KFold scheme, you will probably wait for a very long time.
To implement a faster version, note, that to calculate mean target value using all the objects but one given object, you can:
Calculate sum of the target values using all the objects.
Then subtract the target of the given object and divide the resulting value by n_objects - 1.
Note that you do not need to perform 1. for every object. And 2. can be implemented without any for loop.
It is the most convenient to use .transform function as in Method 2.

# YOUR CODE GOES HERE

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Leave-one-out_scheme', corr)

3. Smoothing

Explained starting at 4:03 of Regularization video.
Next, implement smoothing scheme with α=100
. Use the formula from the first slide in the video and 0.3343
as globalmean. Note that nrows is the number of objects that belong to a certain category (not the number of rows in the dataset).

# YOUR CODE GOES HERE

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Smoothing_scheme', corr)

4. Expanding mean scheme

Explained starting at 5:50 of Regularization video.
Finally, implement the expanding mean scheme. It is basically already implemented for you in the video, but you can challenge yourself and try to implement it yourself. You will need cumsum and cumcount functions from pandas.

# YOUR CODE GOES HERE

corr = np.corrcoef(all_data['target'].values, encoded_feature)[0][1]
print(corr)
grader.submit_tag('Expanding_mean_scheme', corr)

[來自小宇宙, 無關 coursera 或 kaggle]
基本上, 生統的作業已經遲繳, 只有一樣一樣來, 努力完成 30 天, 白天上班前寫生統作業, 下班後看 coursera 寫鐵人文, 基本上第一天之後的每一天每次寫鐵人文都在跟自己奮戰, 剩 9 天了！ hold 住!! 堅持住 !!

[Day 20] Regularization 正規化 - part I

[Day 22] Hyperparameter tuning / 調校超參數 part I

系列文

跟top kaggler學習如何贏得資料分析競賽共 30 篇

RSS系列文訂閱系列文

21 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22207 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

跟top kaggler學習如何贏得資料分析競賽 系列 第 21 篇